You want to build a word cloud, an infographic where the size of a word corresponds to how often it appears in the body of text.
To do this, you'll need data. Write code that takes a long string and builds its word cloud data in a dictionary , where the keys are words and the values are the number of times the words occurred.
Think about capitalized words. For example, look at these sentences:
'After beating the eggs, Dana read the next step:'
'Add milk and eggs, then add flour and sugar.'
What do we want to do with "After", "Dana", and "add"? In this example, your final dictionary should include one "Add" or "add" with a value of . Make reasonable (not necessarily perfect) decisions about cases like "After" and "Dana".
Assume the input will only contain words and standard punctuation.
You could make a reasonable argument to use regex in your solution. We won't, mainly because performance is difficult to measure and can get pretty bad.
In our solution, we make three decisions:
To split the words in the input string and populate a dictionary of the unique words to the number of times they occurred, we:
If the input word is uppercase and there's a lowercase version in the dictionary, we increment the lowercase version's count. If the input word is lowercase and there's an uppercase version in the dictionary, we "demote" the uppercase version by adding the lowercase version and giving it the uppercase version's count.
class WordCloudData:
def __init__(self, input_string):
self.words_to_counts = {}
self.populate_words_to_counts(input_string)
def populate_words_to_counts(self, input_string):
# Iterates over each character in the input string, splitting
# words and passing them to add_word_to_dictionary()
current_word_start_index = 0
current_word_length = 0
for i, character in enumerate(input_string):
# If we reached the end of the string we check if the last
# character is a letter and add the last word to our dictionary
if i == len(input_string) - 1:
if character.isalpha():
current_word_length += 1
if current_word_length > 0:
current_word = input_string[current_word_start_index:
current_word_start_index + current_word_length]
self.add_word_to_dictionary(current_word)
# If we reach a space or emdash we know we're at the end of a word
# so we add it to our dictionary and reset our current word
elif character == ' ' or character == '\u2014':
if current_word_length > 0:
current_word = input_string[current_word_start_index:
current_word_start_index + current_word_length]
self.add_word_to_dictionary(current_word)
current_word_length = 0
# We want to make sure we split on ellipses so if we get two periods in
# a row we add the current word to our dictionary and reset our current word
elif character == '.':
if i < len(input_string) - 1 and input_string[i + 1] == '.':
if current_word_length > 0:
current_word = input_string[current_word_start_index:
current_word_start_index + current_word_length]
self.add_word_to_dictionary(current_word)
current_word_length = 0
# If the character is a letter or an apostrophe, we add it to our current word
elif character.isalpha() or character == '\'':
if current_word_length == 0:
current_word_start_index = i
current_word_length += 1
# If the character is a hyphen, we want to check if it's surrounded by letters
# If it is, we add it to our current word
elif character == '-':
if i > 0 and input_string[i - 1].isalpha() and \
input_string[i + 1].isalpha():
current_word_length += 1
else:
if current_word_length > 0:
current_word = input_string[current_word_start_index:
current_word_start_index + current_word_length]
self.add_word_to_dictionary(current_word)
current_word_length = 0
def add_word_to_dictionary(self, word):
# If the word is already in the dictionary we increment its count
if word in self.words_to_counts:
self.words_to_counts[word] += 1
# If a lowercase version is in the dictionary, we know our input word must be uppercase
# but we only include uppercase words if they're always uppercase
# so we just increment the lowercase version's count
elif word.lower() in self.words_to_counts:
self.words_to_counts[word.lower()] += 1
# If an uppercase version is in the dictionary, we know our input word must be lowercase.
# since we only include uppercase words if they're always uppercase, we add the
# lowercase version and give it the uppercase version's count
elif word.capitalize() in self.words_to_counts:
self.words_to_counts[word] = 1
self.words_to_counts[word] += self.words_to_counts[word.capitalize()]
del self.words_to_counts[word.capitalize()]
# Otherwise, the word is not in the dictionary at all, lowercase or uppercase
# so we add it to the dictionary
else:
self.words_to_counts[word] = 1
Runtime and memory cost are both . This is the best we can do because we have to look at every character in the input string and we have to return a dictionary of every unique word. We optimized to only make one pass over our input and have only one data structure.
We haven't explicitly talked about how to handle more complicated character sets. How would you make your solution work with more unicode characters? What changes need to be made to handle silly sentences like these:
I'm singing ♬ on a ☔ day.
☹ + ☕ = ☺.
To handle capitalized words, there were lots of heuristics and approaches we could have used, each with their own strengths and weaknesses. Open-ended questions like this can really separate good engineers from great engineers.
Good engineers will come up with a solution, but great engineers will come up with several solutions, weigh them carefully, and choose the best solution for the given context. So as you're running practice questions, challenge yourself to keep thinking even after you have a first solution. See how many solutions you can come up with. This will grow your ability to quickly see multiple ways to solve a problem, so you can figure out the best solution. And use the hints and gotchas on each Interview Cake question—they're designed to help you cultivate this skill.
Do you have an answer?
Wanna review this one again later? Or do you feel like you got it all?
Mark as done Pin for review laterReset editor
Powered by qualified.io